Time-space trade-offs for Lempel-Ziv compressed indexing

نویسندگان

  • Philip Bille
  • Mikko Berggren Ettienne
  • Inge Li Gørtz
  • Hjalte Wedel Vildhøj
چکیده

Given a string S, the compressed indexing problem is to preprocess S into a compressed representation that supports fast substring queries. The goal is to use little space relative to the compressed size of S while supporting fast queries. We present a compressed index based on the Lempel–Ziv 1977 compression scheme. We obtain the following time-space trade-offs: For constant-sized alphabets (i) O(m + occ lg lgn) time using O(z lg(n/z) lg lg z) space, or (ii) O(m(1 + lg ǫ z lg(n/z) ) + occ(lg lg n+ lg ǫ z)) time using O(z lg(n/z)) space, For integer alphabets polynomially bounded by n (iii) O(m(1 + lg ǫ z lg(n/z) ) + occ(lg lgn + lg ǫ z)) time using O(z(lg(n/z) + lg lg z)) space, or (iv) O(m + occ(lg lg n+ lg z)) time using O(z(lg(n/z) + lg z)) space, where n and m are the length of the input string and query string respectively, z is the number of phrases in the LZ77 parse of the input string, occ is the number of occurrences of the query in the input and ǫ > 0 is an arbitrarily small constant. In particular, (i) improves the leading term in the query time of the previous best solution from O(m lgm) to O(m) at the cost of increasing the space by a factor lg lg z. Alternatively, (ii) matches the previous best space bound, but has a leading term in the query time of O(m(1+ lg ǫ z lg(n/z) )). However, for any polynomial compression ratio, i.e., z = O(n), for constant δ > 0, this becomes O(m). Our index also supports extraction of any substring of length l in O(l+lg(n/z)) time. Technically, our results are obtained by novel extensions and combinations of existing data structures of independent interest, including a new batched variant of weak prefix search.

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

Fast Lempel-Ziv Decompression in Linear Space

We consider the problem of decompressing the Lempel-Ziv 77 representation of a string S ∈ [σ] using a working space as close as possible to the size z of the input. The folklore solution for the problem runs in optimal O(n) time but requires random access to the whole decompressed text. A better solution is to convert LZ77 into a grammar of size O(z log(n/z)) and then stream S in optimal linear...

متن کامل

Universal Compressed Text Indexing

The rise of repetitive datasets has lately generated a lot of interest in compressed self-indexes based on dictionary compression, a rich and heterogeneous family that exploits text repetitions in different ways. For each such compression scheme, several different indexing solutions have been proposed in the last two decades. To date, the fastest indexes for repetitive texts are based on the ru...

متن کامل

Eecient Algorithms for Lempel-ziv Encoding

We consider several basic problems for texts and show that if the input texts are given by their Lempel-Ziv codes then the problems can be solved deterministically in polynomial time in the case when the original (uncompressed) texts are of exponential size. The growing importance of massively stored information requires new approaches to algorithms for compressed texts without decompressing. D...

متن کامل

CHICO: A Compressed Hybrid Index for Repetitive Collections

Indexing text collections to support pattern matching queries is a fundamental problem in computer science. New challenges keep arising as databases grow, and for repetitive collections, compressed indexes become relevant. To successfully exploit the regularities of repetitive collections different approaches have been proposed. Some of these are Compressed Suffix Array, Lempel-Ziv, and Grammar...

متن کامل

Indexing Highly Repetitive Collections

The need to index and search huge highly repetitive sequence collections is rapidly arising in various fields, including computational biology, software repositories, versioned collections, and others. In this short survey we briefly describe the progress made along three research lines to address the problem: compressed suffix arrays, grammar compressed indexes, and Lempel-Ziv compressed indexes.

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

عنوان ژورنال:
  • Theor. Comput. Sci.

دوره 713  شماره 

صفحات  -

تاریخ انتشار 2017